Add four-nines reliability framework with circuit breakers, DLQ, and monitoring by eighteyes · Pull Request #2 · eighteyes/tx

eighteyes · 2026-03-11T06:22:41Z

Summary

Implements a comprehensive reliability module (src/reliability/) providing four-nines (99.99%) patterns for TX mesh execution. This adds production-grade failure handling, recovery, and monitoring capabilities.

Key Changes

Core Reliability Components:

ReliabilityManager (reliability-manager.ts): Central coordinator integrating all reliability features with dispatcher bindings for kill/requeue actions
DeadLetterQueue (dead-letter-queue.ts): Session-aware failure recovery with two modes—session resume (preserves conversation context) and message requeue (for undeliverable messages)
CircuitBreaker (circuit-breaker.ts): Per-agent failure isolation with closed/open/half-open states, SQLite persistence, and configurable thresholds
HeartbeatMonitor (heartbeat-monitor.ts): Detects stalled workers via output silence detection (warn/stale/dead levels) with automatic worker killing
SLITracker (sli-tracker.ts): Service level indicator tracking with success rates, MTTR calculation, and failure taxonomy (model_error, routing_error, timeout, stuck, etc.)
SafeMode (safe-mode.ts): Gradual autonomy control with four levels (normal/cautious/restricted/lockdown) that restrict tool access based on SLI thresholds
CheckpointLog (checkpoint-log.ts): FSM state transition checkpoints for recovery rewinding to named states

Integration:

Dispatcher now initializes ReliabilityManager and binds kill/requeue callbacks
New CLI commands: tx mesh health [mesh] (SLI dashboard), tx mesh dlq [mesh] (DLQ entries), tx mesh recover [mesh] (recovery workflow)
Configuration via .ai/tx/data/config.yaml with per-component thresholds

Test Meshes:

reliability-test/: Simple 3-agent mesh (planner→worker→checker) for testing basic reliability features
reliability-fsm/: FSM-based mesh (analyze→build→verify) for testing state machine integration and iteration tracking

Documentation:

docs/reliability.md: Comprehensive guide mapping each "nine" to TX mechanisms with feature tables and configuration examples
Updated docs/guardrails.md with reliability configuration section
Updated src/prompt/core.ts with recovery workflow guidance (diagnose → present options → get approval)

Implementation Details

Session-aware recovery: DLQ captures sessionId at failure time, enabling resume that preserves full conversation history rather than raw message replay
Persistent circuit breaker state: Checkpoints to SQLite for recovery across restarts
Configurable thresholds: All timeouts, failure counts, and SLI escalation points configurable via YAML
Failure taxonomy: Categorizes failures (model_error, routing_error, stuck, circuit_open, etc.) for targeted diagnostics
Safe mode auto-escalation: Optional automatic restriction of agent capabilities when SLI drops below thresholds
Per-mesh isolation: Circuit breakers, safe mode levels, and checkpoints tracked per mesh/agent for failure isolation

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

…t breakers, SLI tracking, and safe mode Implements Karpathy's "March of Nines" patterns for TX mesh reliability: - Dead Letter Queue: Failed messages persist for replay instead of silent drops - Circuit Breaker: Per-agent failure isolation prevents cascading failures - Heartbeat Monitor: Detects stalled workers at warn/stale/dead thresholds - SLI Tracker: Measures success rates, MTTR, and failure taxonomy per mesh - Safe Mode: Gradual autonomy control (normal/cautious/restricted/lockdown) - ReliabilityManager: Single integration point wired into dispatcher Includes two test meshes (reliability-test, reliability-fsm) and updated guardrails docs. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

… breaker checkpointing - DLQ: replayOne(), replayAll(), replayForAgent() — re-injects failed messages back into the live system through SystemMessageWriter with [DLQ REPLAY] prefix and original failure context - Circuit Breaker: SQLite checkpointing — persists open/half_open circuit states to circuit_breaker_checkpoints table, restores on restart so agents that were failing before a crash stay circuit-broken - HeartbeatMonitor: Fix NodeJS.Timeout type to ReturnType<typeof setInterval> - ReliabilityManager: Expose replayDLQ(), replayDLQEntry(), replayDLQForAgent() and pass DB to CircuitBreaker constructor for persistence https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

…replay Replace naive message replay with session-aware recovery that preserves conversation history. DLQ now captures sessionId at failure time and uses RecoveryMode (session_resume/requeue/manual) to determine the right recovery strategy. When a worker crashes mid-work with an active session, recovery resumes the SDK session instead of replaying a raw message. - Rewrite DLQ schema with session_id, recovery_mode, failure_category - Update ReliabilityManager with session-aware deadLetter() and recover*() APIs - Wire dispatcher error handler to capture sessionId and route exhausted retries to DLQ with full session context - Export RecoveryMode, RecoveryResult, FailureContext types from index https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

Surface reliability internals via CLI: - tx mesh health [mesh] — SLI nines dashboard with success rate, MTTR, failure categories, circuit breaker states, safe mode level, agent health, and DLQ summary. Per-mesh/per-agent breakdown when mesh name provided. - tx mesh dlq [mesh] — List pending dead letter queue entries with recovery mode (session_resume/requeue/manual), failure context, retry counts, and session hints. - tx mesh dlq clear — Garbage collect recovered DLQ entries. Both support --json for programmatic consumption. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

Every reliability feature now has actuation, not just observation: Heartbeat dead → kill: ReliabilityManager.bindDispatcher() receives killAgent callback. When heartbeat fires 'dead', it kills the stuck worker via AbortController.abort() and records the failure. DLQ recovery (3 trigger paths): 1. Automatic on startup — dispatcher calls recoverAll() 2. CLI — tx mesh recover <mesh> sends SIGUSR2 to dispatcher 3. Front-matter — message with recover: true triggers recovery Session resume: writes message with session-id front-matter so dispatcher spawns worker resuming the SDK conversation. Requeue: re-injects original message via SystemMessageWriter. Safe mode enforcement: createSafeModeHook() returns a PreToolUse hook (same pattern as write-gate) that blocks Write/Edit/Bash at restricted+ levels. Hook is registered per-agent at spawn time. SIGUSR2 dlq-recover control signal in start.ts. tx mesh recover CLI with SIGUSR2 + message fallback. Test mesh config with tight thresholds for quick testing. docs/reliability.md — complete guide for all features. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

Adds the ability to rewind recovery to any FSM state checkpoint, not just the crash point. Core agent can now say "rewind-to: build" to skip failed work and resume from a known-good state. Checkpoint log (SQLite): Saves session IDs at every FSM state transition in the dispatcher's onWorkerComplete handler. Key: mesh_name + state_name → session_id. Lookup, list, GC, and clear operations. rewind-to front-matter: recover: true + rewind-to: <state> on a message looks up the checkpoint for that state and uses its session ID instead of the DLQ entry's crash-point session. Three trigger paths: 1. CLI: tx mesh recover <mesh> --rewind-to=build 2. Message: recover: true + rewind-to: build front-matter 3. SIGUSR2: {"action":"dlq-recover","mesh":"x","rewindTo":"build"} tx mesh recover now shows available checkpoints before recovering. Core prompt updated with Reliability & Recovery section teaching the agent how to use recover, rewind-to, and check health. mesh-builder skill updated with reliability front-matter fields. docs/reliability.md updated with checkpoint log docs. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

Core prompt now requires a 4-step workflow for recovery: 1. Diagnose (tx mesh health + tx mesh dlq) 2. Present options to user (resume vs rewind vs drop + checkpoints) 3. Get explicit confirmation 4. Execute chosen recovery Never triggers recovery silently. The user sees what failed, why, and which checkpoints are available before approving any action. docs/reliability.md updated to match: human-initiated recovery is the preferred path, automatic recovery only on startup crash. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

Each reliability priority now has explicit human review steps: 1. Checkpoints + replay: checkpoint notification, replay approval, post-replay review 2. Metrics + tracking: threshold alerts, safe mode escalation/de-escalation approval 3. Retry-with-variation: failure notification, variation transparency, exhaustion review 4. Schema validation: failure notification, correction approval, partial pass handling 5. Agent classification: classification review, non-critical failure reporting, promotion decisions 6. Observability dashboard: anomaly alerts, trend review, cost gates, weekly digest Core principle: "The system does work. The human makes decisions." Core prompt updated with condensed human review gates checklist. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

…gates Documents all existing reliability features organized by nines level: - Nine 1 (90%): SQLite WAL, worker retries, injection retries, routing correction - Nine 2 (99%): Parity gate, FSM validation, mesh validator, identity gate, write gate - Nine 2.5: Nudge detector, deadlock breaker, stale cleaner, quality iteration loops - Nine 3 (99.9%): Circuit breaker, heartbeat, DLQ, SLI tracker, safe mode, checkpoints - Nine 4 (99.99%): Roadmap items with human review gates Each level includes a feature table (what/where) and explicit human review steps. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

Adds features found across the codebase organized by nines level: - Nine 1: graceful shutdown, usage policy recovery, recovery handler escalation - Nine 2: manifest validator, guardrail config chain - Nine 2.5: session suspend/resume, FSM persistence + backup, session store backfill - Nine 3: rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

Bash guard (write-gate.ts createBashHook) intercepts Bash redirects (>, >>, tee) and validates target paths against write manifest. Strike system: 1-2 errors with paths, 3+ kills worker. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

… review - Add summary table to Nine 3 (matching Nine 1/2/2.5 format) - Add detailed explanations for all Nine 1/2/2.5 features - Extract all human review gates to dedicated HUMAN_REVIEW.md - Restructure roadmap into table + explanations https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg

Copilot

Pull request overview

Adds a new src/reliability/ module and integrates it into TX mesh execution to provide DLQ recovery, circuit breaking, heartbeat-based stuck detection, SLI tracking, safe-mode enforcement, and FSM checkpoint-based rewind recovery. It also extends the CLI and documentation to expose health/DLQ/recovery workflows and introduces two new test meshes to exercise the features.

Changes:

Integrate ReliabilityManager into the worker dispatcher for spawn gating, heartbeat tracking, DLQ routing/recovery, safe-mode tool blocking, and FSM checkpoint logging.
Add new reliability primitives (DLQ, circuit breaker, heartbeat monitor, SLI tracker, safe mode, checkpoint log) and expose CLI commands (tx mesh health|dlq|recover).
Add documentation and test meshes for reliability workflows and configuration.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
src/worker/dispatcher.ts	Initializes and wires reliability manager; adds safe-mode hook, heartbeat recording, spawn gating, DLQ routing/recovery triggers, FSM checkpoint saving.
src/reliability/sli-tracker.ts	New in-memory SLI tracker for success/failure rates, MTTR, and failure taxonomy.
src/reliability/safe-mode.ts	New safe-mode state machine for per-mesh/global tool/action restrictions based on SLI thresholds.
src/reliability/reliability-manager.ts	Coordinator tying together DLQ, circuit breaker, heartbeat, SLI, safe mode, and checkpoints; provides recovery and status APIs.
src/reliability/index.ts	Barrel exports for the reliability module.
src/reliability/heartbeat-monitor.ts	New heartbeat monitor for warn/stale/dead silence thresholds with callbacks.
src/reliability/dead-letter-queue.ts	New SQLite-backed DLQ with recovery modes and stats.
src/reliability/circuit-breaker.ts	New per-agent circuit breaker with SQLite checkpoint persistence.
src/reliability/checkpoint-log.ts	New SQLite-backed FSM checkpoint/session-id log enabling rewind-to recovery.
src/prompt/core.ts	Adds reliability & recovery guidance and the human-approval workflow to the core prompt.
src/cli/start.ts	Adds SIGUSR2 control action to trigger DLQ recovery through a running dispatcher.
src/cli/mesh.ts	Adds `tx mesh health`, `tx mesh dlq`, and `tx mesh recover` commands and `--rewind-to` parsing.
meshes/reliability-test/config.yaml	Adds linear 3-agent test mesh configured with tight reliability thresholds.
meshes/reliability-test/planner/prompt.md	Adds planner prompt for reliability-test mesh.
meshes/reliability-test/worker/prompt.md	Adds worker prompt for reliability-test mesh.
meshes/reliability-test/checker/prompt.md	Adds checker prompt for reliability-test mesh.
meshes/reliability-fsm/config.yaml	Adds FSM test mesh (analyze/build/verify/complete) for checkpoint/rewind testing.
meshes/reliability-fsm/analyst/prompt.md	Adds analyst prompt for reliability-fsm mesh.
meshes/reliability-fsm/builder/prompt.md	Adds builder prompt for reliability-fsm mesh.
meshes/reliability-fsm/verifier/prompt.md	Adds verifier prompt for reliability-fsm mesh.
docs/reliability.md	Adds comprehensive reliability documentation and operational workflows.
docs/guardrails.md	Documents reliability configuration and mechanisms in the guardrails guide.
docs/HUMAN_REVIEW.md	Adds human review gates and policies for reliability features.
.gitignore	Whitelists the two new reliability test meshes.
.claude/skills/mesh-builder/SKILL.md	Documents reliability-related front-matter fields for recovery/resume workflows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-11T06:59:43Z

src/reliability/circuit-breaker.ts

+      case 'half_open':
+        // Allow single probe request
+        return true;
+    }


In half_open state, canExecute() returns true for every call, which allows unlimited concurrent probe executions. That doesn’t match the “allow one probe request” behavior described in the header comment and in docs. Consider tracking an in-flight probe (or transitioning to open/closed after the first probe decision) so only one request is allowed until it succeeds/fails.

Copilot · 2026-03-11T06:59:44Z

src/worker/dispatcher.ts

+      requeueMessage: (from: string, to: string, type: string, payload: Record<string, unknown>, extraFrontmatter?: Record<string, string>) => {
+        this.systemWriter.write({
+          from,
+          to,
+          type,
+          headline: (payload.headline as string) || 'DLQ recovery',
+          body: (payload.body as string) || '',
+          extraFrontmatter: { ...extraFrontmatter, ...Object.fromEntries(
+            Object.entries(payload).filter(([k]) => !['headline', 'body'].includes(k)).map(([k, v]) => [k, String(v)])
+          )},
+        });


requeueMessage writes session-id, resume-mesh, etc. into extraFrontmatter, but SystemMessageWriter.write() only inserts {msg-id, headline, body, filepath} into the SQLite queue payload. Since the dispatcher later reads nextMsg.payload['session-id'] / ['resume-mesh'], recovered messages won’t actually resume sessions or preserve mesh state. Consider ensuring these fields are also present in the queued payload (or change the dispatcher to parse frontmatter from source_file when deciding session continuation).

Copilot · 2026-03-11T06:59:44Z

src/reliability/heartbeat-monitor.ts

+ * Monitor last output timestamps and escalate when stale.
+ *
+ * Stale detection levels:
+ * 1. Warning (60s no output): Log, could be thinking
+ * 2. Stale (120s no output): Inject nudge to worker
+ * 3. Dead (300s no output): Kill worker, route to DLQ


The module header says stale agents (120s silence) trigger a nudge injection, but HeartbeatMonitor only logs and exposes callbacks; and ReliabilityManager.bindDispatcher() currently just logs on stale. Either implement the nudge behavior at the stale threshold (via dispatcher bindings) or adjust the comment/docs to match the current behavior.

Suggested change

* Monitor last output timestamps and escalate when stale.

*

* Stale detection levels:

* 1. Warning (60s no output): Log, could be thinking

* 2. Stale (120s no output): Inject nudge to worker

* 3. Dead (300s no output): Kill worker, route to DLQ

* Monitor last output timestamps and surface health state so callers can react.

*

* Stale detection levels (as reported by this monitor):

* 1. Warning (60s no output): Mark as "warn" and log; worker may still be thinking

* 2. Stale (120s no output): Mark as "stale", log, and invoke callbacks so callers can nudge

* 3. Dead (300s no output): Mark as "dead", log, and invoke callbacks so callers can terminate / DLQ

Copilot · 2026-03-11T06:59:44Z

docs/guardrails.md

+Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually.
+
+- Automatic retry with exponential backoff


The DLQ section claims “automatic retry with exponential backoff”, but the implementation shown in this PR doesn’t include DLQ retry scheduling/backoff logic (DLQ is a persistence + recovery mechanism; worker retries are currently fixed-delay). Update the docs to reflect the actual behavior, or implement the described backoff mechanism.

Suggested change

Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually.

- Automatic retry with exponential backoff

Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually; the DLQ itself does not schedule retries or implement backoff.

- Integration with worker retries (fixed-delay; DLQ only after retries exhausted)

Copilot · 2026-03-11T06:59:44Z

src/reliability/reliability-manager.ts

+
+    // Load config from config.yaml if exists
+    const fileConfig = this.loadConfigFromFile(workDir);
+    const merged = { ...fileConfig, ...config };


Config merging is shallow (const merged = { ...fileConfig, ...config }), so providing a partial nested override (e.g., { circuitBreaker: { cooldownMs: ... } }) will drop other circuitBreaker values from fileConfig. Consider a per-section merge (merged.circuitBreaker = { ...fileConfig.circuitBreaker, ...config?.circuitBreaker }, etc.) to avoid surprising config loss.

Suggested change

const merged = { ...fileConfig, ...config };

const merged: ReliabilityConfig = {

...fileConfig,

...config,

circuitBreaker: {

...(fileConfig?.circuitBreaker ?? {}),

...(config?.circuitBreaker ?? {}),

},

heartbeat: {

...(fileConfig?.heartbeat ?? {}),

...(config?.heartbeat ?? {}),

},

safeMode: {

...(fileConfig?.safeMode ?? {}),

...(config?.safeMode ?? {}),

},

dlq: {

...(fileConfig?.dlq ?? {}),

...(config?.dlq ?? {}),

},

sli: {

...(fileConfig?.sli ?? {}),

...(config?.sli ?? {}),

},

};

Copilot · 2026-03-11T06:59:45Z

src/worker/dispatcher.ts

+    // Recover any pending DLQ entries from previous crash
+    const dlqRecovery = this.reliability.recoverAll();
+    if (dlqRecovery.length > 0) {
+      log.info('dispatcher', 'DLQ startup recovery', {
+        attempted: dlqRecovery.length,
+        succeeded: dlqRecovery.filter(r => r.success).length,
+        failed: dlqRecovery.filter(r => !r.success).length,
+      });


recoverAll() is invoked unconditionally on startup. This will auto-recover any pending DLQ entries even if they were created by retry exhaustion (not just a crash), which conflicts with the “human approval required” guidance and can cause repeated recovery loops. Consider gating this behind an explicit config flag and/or only auto-recovering entries that were marked as crash-recovery eligible.

Suggested change

// Recover any pending DLQ entries from previous crash

const dlqRecovery = this.reliability.recoverAll();

if (dlqRecovery.length > 0) {

log.info('dispatcher', 'DLQ startup recovery', {

attempted: dlqRecovery.length,

succeeded: dlqRecovery.filter(r => r.success).length,

failed: dlqRecovery.filter(r => !r.success).length,

});

// Optionally recover any pending DLQ entries from previous crash.

// Controlled via WORKER_AUTO_RECOVER_DLQ env var to avoid reprocessing

// entries that may require human approval (e.g., retry exhaustion).

const shouldAutoRecoverDlq =

process.env.WORKER_AUTO_RECOVER_DLQ === 'true' ||

process.env.WORKER_AUTO_RECOVER_DLQ === '1';

if (shouldAutoRecoverDlq) {

const dlqRecovery = this.reliability.recoverAll();

if (dlqRecovery.length > 0) {

log.info('dispatcher', 'DLQ startup recovery', {

attempted: dlqRecovery.length,

succeeded: dlqRecovery.filter(r => r.success).length,

failed: dlqRecovery.filter(r => !r.success).length,

});

}

Copilot · 2026-03-11T06:59:46Z

src/worker/dispatcher.ts

+              toAgent: agentId,
+              msgType: nextMsg?.type,
+              payload: nextMsg?.payload as Record<string, unknown>,
+              sourceFile: nextMsg?.source_file,


DLQ entries created on retry exhaustion aren’t being marked as such. DeadLetterQueue.add() decides recovery_mode using retry_count, but deadLetter() is called without passing the worker’s retryCount/maxRetries, so entries will default to retry_count=0 and often become auto-recoverable even after retries were exhausted. Pass machine.currentContext.retryCount (and ideally maxRetries) into the DLQ entry so exhausted cases correctly become manual and don’t auto-recover forever.

Suggested change

sourceFile: nextMsg?.source_file,

sourceFile: nextMsg?.source_file,

// Include retry metadata so DLQ can determine correct recovery mode

retryCount: machine.currentContext.retryCount,

maxRetries: machine.currentContext.maxRetries,

retry_count: machine.currentContext.retryCount,

max_retries: machine.currentContext.maxRetries,

Copilot · 2026-03-11T06:59:46Z

src/reliability/sli-tracker.ts

+export interface SLIConfig {
+  /** How long to retain data in ms (default: 7 days) */
+  retentionMs: number;
+  /** Bucketing interval for rate calculations (default: 60000 = 1 min) */
+  bucketMs: number;
+}


SLIConfig.bucketMs is defined and included in DEFAULT_CONFIG, but it’s never referenced. Either implement bucketing (e.g., to support rate over time windows without scanning all events) or remove the config field to avoid a misleading API surface.

Copilot · 2026-03-11T06:59:46Z

src/cli/mesh.ts

+  const queue = new MessageQueue(queuePath);
+  const reliability = new ReliabilityManager(queue.getDb(), cwd);
+  const status = reliability.getStatus(300_000); // 5 min window
+


tx mesh health instantiates a fresh ReliabilityManager, but SLI, heartbeat health, and safe-mode state are currently in-memory only. As a result, this command will typically report 100% success with 0 events and default safe-mode/health, which is misleading. Either persist these metrics/state (SQLite tables, log-backed snapshots, etc.) or have mesh health query a running dispatcher (similar to the SIGUSR2 control path) so it reflects real runtime data.

Suggested change

// Note: ReliabilityManager keeps key state in-memory in the dispatcher.

// A freshly instantiated manager (as used here) may only see persisted data.

// If there are no recorded events, reporting "100% success" is misleading,

// so treat health as unavailable instead.

const sli: any = status.sli as any;

const totalEvents = sli && typeof sli.totalEvents === 'number' ? sli.totalEvents : undefined;

if (!totalEvents || totalEvents === 0) {

const warning = 'No reliability events recorded for this time window. ' +

'SLI/health/safe-mode state is unavailable outside a running dispatcher with in-memory state.';

if (flags.json) {

console.log(JSON.stringify({ ...status, warning, statusUnavailable: true }, null, 2));

return;

}

console.log();

console.log(chalk.bold('Reliability Health'));

console.log(chalk.dim('─'.repeat(50)));

console.log(chalk.yellow(warning));

return;

}

Copilot · 2026-03-11T06:59:46Z

src/cli/mesh.ts

+    } else if (arg.startsWith('--rewind-to=')) {
+      flags.rewindTo = arg.split('=')[1];
+    } else if (arg === '--rewind-to') {
+      // Next arg will be picked up as a positional, but we handle it here
+      const idx = args.indexOf(arg);
+      if (idx < args.length - 1 && !args[idx + 1].startsWith('-')) {
+        flags.rewindTo = args[idx + 1];
+      }
    }


parseFlags() uses args.indexOf('--rewind-to') while iterating, which returns the first occurrence and can mis-read the value if flags repeat or ordering changes. Use an indexed loop (for i=0; i<args.length; i++) to reliably read the next arg.

eighteyes · 2026-03-11T17:22:33Z

@copilot open a new pull request to apply changes based on the comments in this thread

Copilot · 2026-03-11T17:22:40Z

@eighteyes I've opened a new pull request, #3, to work on those changes. Once the pull request is ready, I'll request review from you.

claude added 12 commits March 11, 2026 06:21

eighteyes requested a review from Copilot March 11, 2026 06:44

Copilot started reviewing on behalf of eighteyes March 11, 2026 06:45 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

eighteyes merged commit c0845c5 into main Mar 11, 2026
4 checks passed

Copilot AI mentioned this pull request Mar 11, 2026

fix(reliability): Address 12 review issues across circuit breaker, DLQ, config merge, and CLI #3

Draft

		Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually.

		- Automatic retry with exponential backoff

-    const merged = { ...fileConfig, ...config };
+    const merged: ReliabilityConfig = {
+      ...fileConfig,
+      ...config,
+      circuitBreaker: {
+        ...(fileConfig?.circuitBreaker ?? {}),
+        ...(config?.circuitBreaker ?? {}),
+      },
+      heartbeat: {
+        ...(fileConfig?.heartbeat ?? {}),
+        ...(config?.heartbeat ?? {}),
+      },
+      safeMode: {
+        ...(fileConfig?.safeMode ?? {}),
+        ...(config?.safeMode ?? {}),
+      },
+      dlq: {
+        ...(fileConfig?.dlq ?? {}),
+        ...(config?.dlq ?? {}),
+      },
+      sli: {
+        ...(fileConfig?.sli ?? {}),
+        ...(config?.sli ?? {}),
+      },
+    };

-              sourceFile: nextMsg?.source_file,
+              sourceFile: nextMsg?.source_file,
+              // Include retry metadata so DLQ can determine correct recovery mode
+              retryCount: machine.currentContext.retryCount,
+              maxRetries: machine.currentContext.maxRetries,
+              retry_count: machine.currentContext.retryCount,
+              max_retries: machine.currentContext.maxRetries,

+  // Note: ReliabilityManager keeps key state in-memory in the dispatcher.
+  // A freshly instantiated manager (as used here) may only see persisted data.
+  // If there are no recorded events, reporting "100% success" is misleading,
+  // so treat health as unavailable instead.
+  const sli: any = status.sli as any;
+  const totalEvents = sli && typeof sli.totalEvents === 'number' ? sli.totalEvents : undefined;
+  if (!totalEvents || totalEvents === 0) {
+    const warning = 'No reliability events recorded for this time window. ' +
+      'SLI/health/safe-mode state is unavailable outside a running dispatcher with in-memory state.';
+    if (flags.json) {
+      console.log(JSON.stringify({ ...status, warning, statusUnavailable: true }, null, 2));
+      return;
+    }
+    console.log();
+    console.log(chalk.bold('Reliability Health'));
+    console.log(chalk.dim('─'.repeat(50)));
+    console.log(chalk.yellow(warning));
+    return;
+  }

Conversation

eighteyes commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Implementation Details

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

eighteyes commented Mar 11, 2026

Uh oh!

Uh oh!

Copilot AI commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eighteyes commented Mar 11, 2026 •

edited

Loading